Bioinformatics A Practical Guide to Next Generation Sequencing Data Analysis (Hamid D. Ismail)

174 ◾ Bioinformatics

The HTSeq-count output file contains the feature count for each sample as shown in

Figure 5.3. The feature count file includes tab-delimited columns for gene symbols, tran-

script IDs, and a count column for each sample. We can notice that some genes have zero

reads aligned to them. Later, we will filter out the genes that have no aligned reads and the

one with low coverage.

5.3.5 Normalization

In general, when we analyze gene expression data, we may need to normalize it to avoid

some biases that may arise due to the gene lengths, GC contents, and library sizes (the total

number of reads aligned to all genes in a sample) [21]. The normalization of count data is

important for comparing between expression of genes within the samples and between

different samples. The normalized gene length fixes the bias that may affect within-sample

gene expression comparison. It is known that a longer gene would have a higher chance to

be sequenced than a shorter gene. Consequently, a longer gene would have a higher number

of aligned reads than a shorter one at the same gene expression level in the same sample.

The GC content also affects within-sample comparison of gene expressions. The GC-rich

and GC-poor fragments tend to be under-represented in RNA-Seq sequencing, and hence,

the gene with the GC content closest to 40% would have higher chance to be sequenced

[22]. The library size affects the comparison between the expressions of the same gene in

different samples (between-sample effect).

There are several normalization methods for adjusting the biases resulted from the

above-mentioned possible causes. Choosing the right normalization method depends on

whether the comparison is within-sample or between-samples. In the following, we will

discuss some of these normalization methods used by gene expression analysis program

like EdgeR and DESeq2 [23].

5.3.5.1 RPKM and FPKM

RPKM [24] (the reads per kilobase of transcript per million reads mapped) is a normalized

unit for the counts of reads aligned to genes (normalized gene expression unit). It scales

the count data by gene length to adjust for the sequencing bias arising from the differences

in gene lengths. The RPKM is used for within-sample gene expression comparison (i.e.,

comparison between genes in the same sample).

Assume that N reads are aligned to the reference sequence and only k reads are aligned

to the gene i of length li bp, the PRKM of the gene i is calculated as

= 

^



^

^



^

RPKM

(5.1)

In the denominator of Formula 5.1, the length of gene (l) (in base) is divided by 1000 to

be in kilobase and the total number of reads aligned to the reference sequence is divided

by 1000,000 (million). When the number of reads aligned to a gene is divided by the